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ABSTRACT 


Supercomputers capable of performing extremely high 
Speed computation nave been proposed which are based on an 
architecture known as data flow. Application of a Petri 
net-based metnodology is used to evaluate the performance 
attainable by Such an architecture. The architecture 
evaluated is MIT’S cell block data flow arcnitecture which 
is being developed to execute the applicative programming 
language VAL. 

Results snow that for the data flow architecture to 
achieve its goal of high speed coOmputation, intelligent 
multiprogramming schemes need to be developed. One such 
scneme, based on tne notion of a ‘concurrency vector, is 


introduced. 
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I. INTRODUCTION 


A. BACKGROUND 


Despite tne orders-of-magnitude increase in computation 
speed that has occurred since tne early 1958°s, tne need 
still exists today for faster computers. This neec is most 
@eoeeecal in the area of scientific computing, where tnere 
exist computations requiring on the order of a billion 
floating point operations per second [DENNIS, 1980}. 

One approach to achieving higher computation speed is to 
increase tne speed of tne basic logic devices of tne 
computer. This approach, effective in the past, faces 
significant obstacles to future gains because of the speed 
of light limitation to signal propagation and limitations in 
the integrated circuit manufacturing process. 

AP esecond approach to achieving nigner computation speed 
is through the exploitation of parallelism which is (cr can 
be) innerent in algoritnms used to solve a wide range of 
sclentific problems. Sucn parallelism can be present at both 
(me Noperdtion andes procedure levels in @ program. Thus far, 
Such exploitation of parallelism has not reached a limitine 
threshold to faster computation. 

Data flow computing has been proposei aS a conceptually 
viable metnod of acnieving nigher computational speed 


through greater exploitation of inherent algo ri van c 














parallelism. A computer based on the data flow concept 
executes an instruction when itS operands become available. 
No sequential control flow notion exists. Data flow programs 
meee rree Of Sequencing constralnts except those imposed by 
tne flow of operands between instructions. Thus, a data flow 
conputer contrasts fundamentally witn tne “von Neumann’ 
model. Even So, the data flow concept iS capable of 
incorporating into one system all the Known forms of 
parallelism exploitation iucluding VEC Corl Zac On, 


pipelining, and multiprocessing. 


B. RESEARCH APPROACH 

It was the purpose of this fYresearcn to gain insignts 
into the degree of parallelism exploitation obtainable witn 
the data flow=based high speed computation method. The 
Classical issues of nardware utilization, program execution 
tine, and degree of multiprogramming were investigated in 
the context of data flow. Application of an existing Petri 
net~based methodology was the technique uSed to gain these 
insights. 

The hypothesis for this research had two parts. First, 
the suitability of the Petri net-based Requester-Server 
metnodology for prediction of tne performance of data flow 
machines in an efficient, accurate manner was t0 be 
explored. Second, a cnalienge to tne data flow concept was 


made. It was hypothesized that the goal of achieving higher 








speed Computation tnrougn data flow computing 1s 
UnNattainable without achieving a high and “intelliegent’™ 
degree of multiprogramming. By ‘intelligent’ it is meant 
that the mapping of processes onto the hardware shall have 
to be done ina near optimal fashion, defined in terms of 
hardware utilization and program execution time. 

To explore tne two-~part nypothesis, sets of Petri net 
models of data flow vrograms, cnaracterized by a range of 
inherent parallelism, were executed’ on Petri net models of 
tne Dennis-Misunas data flow nardware design [DENNIS, AUG 
1974], uSine the methodology called Requester-Server [COX, 
1978]. Tne nardware models were varied in the number of 
processing elements available for use in executing tne 
sets of program models. Thus software models were ‘run on 
nardware models, and appropriate performance indices were 
measured and analyzed. 

This research iS important because it Suggests a metnod 
for mapping data flow programs onto tne data flow machine to 
demeeve vue desired degree of hign speed computation. 
Additionally, the Requester-Server (R-S) methodology has 
been snmown to be an effective tool for predicting the 
performance of data flow computer architectures. 

Grateful acknowledgment is made to L. A. Cox, designer 
and initial implementer of tne R-S netnodology, and to D. M. 
Stowers, who modified the R-S software to enadle it to run 


on tne PDP-11/58 minicomputer at NPS. 





C. ORGANIZATION 

Tne results reported nere are organized in a tasnion 
conducive to the communication of experimental computer 
science endeavors. Following a review of tne applicable 
literature (Section II), the hypotnesis (Section III) is 
presented. Next, the method used to test the hypothesis is 
presented in detail (Section IV). This section discusses tne 
experimental design which includes tne identification of 
independent and dependent variables, cnaracterizes and 
explains tne Petri net derinition of the data flow hardware 
and software, and ends with an account of the procedure used 
to implement the experiment to test the hypothesis. Results 
of the experiment anda discussion thereof are covered in 
Section ¥V. This section, in addition to demonstrating tne 
suitability of the R-S technique, and exploring the 
multiprogranming response of data flow architectures, 
presents some unexpect2d findings. Section VI Summarizes the 
entire research effort, including the results. Finally, 
5 @ction vil presents recommendations for furtner 


investigation in the area of data flow researcn. 





II. LITERATURE REVIEW 


A. APPROACHES TO PARALLELISM 

In general, computer science literature approacnes tne 
concept One parallelism exploitation from eitner an 
architecture (nardware) or language (software) point of 
“view. In this thesis, parallelism” snall be viewed as 
existing at many hierarchical levelS within aleorithms. Any 
of several differ@€nt computer arcnitectures may be capable 
of exploitine the parallelism which exists at one or more of 
these various hierarchical levels. AS snould be expected, 
Sacm architecture is best-suited at exploiting innerent 
algorithmic parallelism at a particular hierarchical level, 
but not at otners. In contrast, implementation of tne data 
flow concept proposes to exploit inherent algorithmic 
parallelism at all hierarchical levels, in an efficient 
fasmnion. Before presenting tne concept of data flow, a 
review of the range of architectures and strategies 
Gurrenmtiy used to exploit parallelism is presented. 

In an early study of nigh Speed computer architectures 
(FLYNN, 1966] a four element taxonomy was developed which 
Classified computer syStems in terms of the amount of 
parallelism in their instruction streams and data streams 
(see figure II.A.1). (An instruction stream is the series of 


Operations used by the processor; a data stream is the 
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series of operands used by tne processor.) The first element 
of this taxonomy, depicted in fieure II.A.1(a), is the 
serial conputer which executes one instruction at a time, 
affecting at most one data item at a time. Sucn a serial 
machine 1s denoted as a sSingle-instruction 
single-data-stream (SISD) computer. Tne SISD computer can be 
Characterized aS possessing no capability for exploitation 
of algorithmic parallelism. Tne three remaining computer 
System organizations within the Flynn taxonomy do possess 
capabilities for exploiting algorithmic parallelism. 

By Allowing more than one data stream a 
Single-instruction multiple-data-stream computer results, as 
snown in figure II.A.1(b0). This organization allows 
vectorization and iS Known aS avector or array processor 
because each instruction operates on a data vector during 
Gagenm instruction cycle, rather than on just one operand. The 
model in figure II.A.1(b) shows N processors each accepting 
as input its own data stream. It is noteworthy that eacn of 
the N processors is not a standalone serial machine (SISD 
computer) because tne N processors take tne same instruction 
from an external control unit at each time step. 

If tne SISD computer is extended to permit more tnan one 
instruction stream, the multiple-instruction 
single-data-stream (MISD) computer snown in figure II.A.1(c) 
results. This computer syStem organization within Flynnr’s 


taxonomy is present for completeness but has yet to be Shown 
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Figure II.A.1: COMPUTER MODELS (STONE, 1980] 
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to possess much utility. An example of sucn a macnine would 
be one built to generate tables of functions (such as 
Squares and Square roots) of a stream of numbers. Each 
processor would perform a different function on the same 
data item at each time Step. 

Tne fourth and final element of Flynn’s taxonony is one 
which possesses parallelism in both the instruction and data 
streams. This multiple-instruction multiple-cata-stream 
(MIMD) conputer (snown in figure II.A.1(d4)) is made up of N 
complete SISD machines which are interconnected for 
communication purposes. Such a parallel architecture is more 
Besdhiiy recognized aS a multiprocessor in which as many as N 
processors can be performing useful work at the same time. 

Beyond Flynn’s taxonomy are otner approacnes to 
parallelism. The first, pipelining, is a stTratezgy which 
Mames use of the fact thet a processor, in executing an 
instruction, actually performs a Sequence of functions in 
various functional units of the processor. Eacn function is 
performed at a different Stage alone the pipeline. Figure 
II.A.2 sShoOwS a processor witn a simple pipeline design. 
Rather than waiting for each instruction to be completely 
executed before beginning the next instruction, the pipeline 
processor begins execution of the next instruction as soon 
as functional units at the beginning of the pipeline are 
available. Thus, tne pipeline is normally full, containing 


more than one instructions in various stages of execution. 
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The final approach to parallelism to be presented is the 
Strategy of overlappine. ia the traditilond) sense, 
overlapping within a computer system occurs when the central 
processing unit (CPU) is allowed to function concurrently 
with input/output (I/0) operations. Such concurrency was 
prevented in early computers because I/0 operations reauired 
data paths to memory which ran througn CPU registers, 
preventing CPU functions from occurring while performirg 
I/O. Overlapping can occur in otner ways witnin a computer 
but the example given is sufficient to convey the eeneral 
idea. 

The techniques for exploiting algorithmic parallelism 
Paapeaave been presented are not all mutually exclusive. for 
example, the strategies of pipelining and overlapping can be 
included in any of tne four architectures. Furthermore, 
Other more complex machine organizations have been proposed. 
One example is the multiple SIMD (MSIMD) macnine which 
Gomersts Of more than one control units snmaring a pool of 
processors through a switching network [HWANG, 1979]. Such 
nybrids will not be considered furtner. 

Having described the major architectural approaches to 
Gxpiorting algorithmic ‘parallelism it is appropriate to 
characterize the problems for which eacn method is suitable 
and to present some of the difficulties that still exist in 
using each method. By Suitable it is meant that the metnod 


allows the processing of a problem in such a manner that 
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sone speedup in execution tine is achieved in comparison 
with what tne execution time would be for the problem run on 
a serial nacnine. 

The main implementation of the SIMD architecture, the 
array processor, is suitable for computations whicn can be 
described by vector instructions. Also, operands processed 
Simultaneously must be capable of veing fetched 
Simultaneously from memory. Finally, processor 
interconnections must support high speed data froutine 
Bemween processors. If any of the above conditions are not 
met, then the computation may execute in a predominantly 
serial fasnion within tnis SIMD computer. Because of these 
required conditions, the array processor is generally 
considered to be a speciallized, frather than general 
purpose, machine. 

As previously mentioned, the MISD architecture exists 
merely to complete tne Flynn taxonomy and will not be 
discussed further (STONE, 1982}. 

The dominant MIMD computer is the multiprocessor. The 
pulemermoeessor is considered to be a general PULbsose 
machine. Accordingly, many problems should be well-suited 
moueeaecutw?on On such an architecture. Despite the fact that 
such systems have been Shown to work well in a number of 
applications, especially tnose wnich consist of a number of 
concurrently-processable subproblems with minimal data 


Sharing, numerous questions have yet to be answered. These 
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owes trons include now to best organize tne Dard Pied 
computations (i. e. partition the problem) to optimize the 
use of the cooperating processors, NOw to syncnronize tne 
processors in the system, and how to best Share the data 
anong system processors. Also, problems which possess an 
iterative structure Canteen eftricientiy on an array 
processor and avoli the overnead of synchronization and 
scheduling required of the multiprocessor [STONE, 198@]. 

Although not considered architectures. in the sense of 
Flynn’s taxonomy, both pipelining and overlapping (of which 
there exist different types) are general purpose strategies 
thaw, can ~be applied to most provlems. ike ome 
multiprocessor, these techniques also permit the 
partitioning of a problem so that several operating hardware 
pieces can function concurrently. Accordingly, pipelining 
and overlapping are often considered to be forms of 
[emer rocessing. Tne difference lies in the fact that 
pipelining and overlapping perform partitioning at different 
nierarchnical levels of a problem than does tne 
multiprocessing technique. 

Armed with an understanding of the diverse architectural 
approaches to parallelism exploitation that nave been used 
to date, it is logical to proceed with an alternative 
Spprodcn, that of ddta flow. B@fore doing so, nowever, Petri 


nets will be introduced. This is appropriate because the 


Ti 


concepts of data flow computation are a direct application 


of Petri net theory. 


B. PETRI NETS 

Petri net tneory plays an important part in  tnis 
research endeavor for two reasons. First, as nas been 
mentioned, Petri net theory forms the basis for the concepts 
used to describe and define data flow computation. Second, 
Petri net theory is the basis for the RKequester-Server 
methodology that is used in this thesis research as a 
computer performance prediction tool. Because of 1ts 
applicability, Petri net theory shall be presented nerein, 
in an infornal manner, with empnasis placed on its use in 
modelling parallel computation. Those desiring a more formal 
and complete discussion of Petri nets are referred to 
(PETERSON, 1977]. 

Petri nets may be thougnt of as formal, abstract models 
emeerormation flow. Thelr main use has been in tne 
modelling of systems of events in which some events may 
occur concurrentiy but there exist constraints on tne 
frequency, precedence and concurrence of these events. A 
Petri net graph models the static structure of a system. The 
dynamic properties of a system can be represented by 
"executing tne Petri net in response to the flow of 


information (or occurrence of events) in tne system. 
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The static graph of a Petri net is made up of two types 
of nodes: circles (called places) wnich represent 
conditions, and bars (called transitions) which represent 
events. These nodes are connected by directed arcs running 
from either places to transitions or transitions to places. 
The source of a directed arc is the input, and the terminal 
mo@e 1S the output. Tne position of information in a net is 
represented by markers called tokens. 

The dynamic execution of a Petri net is controlled by 
the position and movement of the tokens. A token moves as a 
result of a transition firing. In order for a transition to 
fire, it must be enabled. A transition is enabled when all 
of the places which are inputs to a transition are marked 
with a token. Upon transition firing, a token is removed 
from each of the input places and a token is placed on each 
of the output places of the transition. Tnus, in mocelling 
the dynamic behavior of a SyStem, the occurrence of an event 
is represented by tne firing of the corresponding 
transition. 

Figures I1I.B.1 throvegn II.B.4 snow a Petri net at 
progressive Stages of execution. AS can be observed, the 
status of the execution at a given time can be described by 
the distribution of the tokens in the net. This distribution 
of tokens in a Petri net is called the net marking and 


uniquely defines the state of the net for any given instant. 
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Petri nets are uninterpreted models. Thus, some 
significance must be attached to token movement to indicate 
the intent of the model. This is usually done by labelling 
the nodes of a net to correspond in some way to the system 
being nodelled. However, it snould be remembered that the 
labelling of the nodes of a Petri net in no way affects its 
execution. A second attribute of Petri nets is tneir ability 
to model a sSyStem hierarchically. An entire net may te 
replaced by a single node (place or transition) Or 
modelling at a greater level of abstraction or, conversely, 
a single node may be replaced by a Subnet to snow greater 
detail in tne model. 

meeri nets, as a formal grapn model, are especially 
useful in modelling the flow of information and control in 
systems which can be cnaracterized by asynchronous and 
concurrent behavior. Figure II.B.5 shows the initial marking 
of a Petri net model of such a system. Initially, transition 
Deeeoeenabled because each of its input places, Cl and O62, 
is marked with a token. Firing transition El renoves one 
token eacn trom places Cl and C2, and puts a token into @ach 
output place, C3 and C4. At this point in the net execution, 
transition £3 is disabled because one of its input places, 
C5, still has no token. Transition E2, nowever, is enabled, 
and upon firing causes a token to be removed from place C3 
and Omewoenpos: ted in place CS. AS dn aside, this portion of 


the model could correspond to a system sequencing 
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constraint, that of event ES naving to wait until event E2 
completes. Upon firing transition ES, places C6 and C7 
become marked with tokens as places C4 and C5 lose a token 
each. Transitions E4 and E5 are now enabled and can fire 
simultaneously, the occurrence of which corresponds to 
concurrent events in a modelled system. Doing so, tnat is, 
firing transitions §£4 and 5&5, brings the Petri net model 
back to its original (initial) configuration. 

One other situation that can be represented using Petri 
nets is that of conflict. Figure I[I.B.6 shows a net model of 
Such a Situation. Simply, transitions £1 and E2 are botna 
enabled. However, if either transition fires, the remaining 
transition becomes disabled. In such a case, it is an 
arbitrary decision as to which one fires. Because we would 
like to be able to duplicate experiments and obtain the sane 
results, a scheme that is often used involves simply 
assigning priorities to transitions which are subdject to 
comeeerect in 4 n@t. In this way reproducible results can be 
ensured. If true nondeterminism is desired in such a model, 
a scheme in whicn probabilities are associated with each 
transition can effectively model nondeterminism in tne 
system under study. 

Thus, to properly model a4 syStem with Petri nets, every 
sequence of events in the modelled system should be possibdle 


Ta tne Petri net and very sequence of events in tne Petri 
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net should represent a possible Sequence in the modelled 
system. 

This section has introduced Petri nets and demonstrated 
tneir usefulness in formally modelling information and 
control flow in systems characterized by aSynchronous and 
concurrent behavior. Readers interested in the use of Petri 
nets for performance evaluation of sucn systems are referred 
to [RAMAMOORTHY, 19889] and (RAMCHANDANI, 1974]. The 
following section snall introduce readers to the concept of 
data flow which, as mentioned previously, is a direct 


application of Petri net tneory. 


C. CONCEPT OF DATA FLOW 

Data flow computing is a metnod of multiprocessing whicn 
proposes to exploit inherent algoritnmic parallelism at all 
hierarcnical levels witnin a program. Additional objectives 
include effectively uSing the capabilities of LSI technology 
and simplifying tne programming tasz. The concept of 
computation under data flow was derived by Dennis [DENNIS, 
1974) (and a number of others working independently), 
predominantly from Karp and Miller’s [KARP, 1966] work on 
computation grapns. This section begins by presenting the 
data flow concept from the perspective of language, rather 
than that of architecture. This approach is appropriate in 
view of the fact coe data flow computer systems are deine 


designed as nardware interpreters fora base language tnat 
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is fundamentally different from conventional languages. A 
hardware description of the Dennis-Misunas data flow 
architecture design completes tnis section. 

In a data flow computer, an instruction is executed 4s 
soon as its operands become available. No notion of separate 
control flow exists because the data dependencies define the 
flow of control in a data flow program. In fact, data flow 
computers have no need for a program location counter. 

This contrasts with the traditional “von Neumann 
computer arcnitecture model which uses a global memory whose 
state is altered by the sequential execution of 
instructions. Such a model is limited by a “dottleneck™ 
between the computer control unit and the elobal memory 
[BACKUS, 1978]. This feature allows conventional languages 
to have side-effects, a common example of whicn is tne 
ability of a procedure to modify variables in the calling 
program. Such side-effects are pronibtited under tne data 
flow concept. Furthermore, in data flow, no variables exist, 
nor are there any scope or substitution rules. In fact, the 
data flow concept prohibits the modification of anything 
that has a value. Rather, data flow computing takes inputs 
(operands) and puderates outputs (results) that have not 
previously been defined. Thus, instructions in data flow are 
pure functions. This iS necessary $0 that instruction 
execution can be based solely on the availability of data 


(operands). Thus the data dependencies must be equivalent 
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Moree ang in fact define, the sequencing constraints in a 
program. Also, to exploit parallelism at all levels, it must 
be possible to derive tnese data dependencies from tne high 
level language program instructions [ACKERMAN, 1979]. 

A language wnich allows processing by means of operators 
applied to values is called an applicative language. VAL 
(Value-oriented Algorithmic Language) is a high level data 
flow applicative language under development at MIT [ACKERMAN 
and DENNIS, 1979]. It prevents any Side-effects by requiring 
programmers to write expressions and functions; statements 
and subroutines are not allowed in the language. Because of 
this constraint, most concurrency is apparent ina hieh 
level language program written in VAL. For tne purposes of 
this research, no further understanding of the high level 
language of data flow is required. Information about high 
level language alternatives iS available in [McGRAW, 1984], 
(ACKERMAN and DENNIS, 1979], and [ACKERMAN, 1979]. 

At wnat would correspond to the assembly language level, 
a data flow computation can be represented aS a eraph. The 
nodes of the graph correspond to operators and the arcs 
represent data paths. An arc into a node represents an input 
operand paths an arc leaving a node corresponds to a result 
patn. Data flow graph execution occurs as operands become 
available at each node. When the input arcs of a node eacn 
have a value on them, the node can execute by removing those 


values, conputing the operation, and placing tne results on 
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NODE ENABLED NODE AFTER EXECUTING 


Poaceae tis C. i: DATA TLOW NODE EXECUTION 
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Stats 





function Stats (X,Y,Z: real returns real, real) 
let 
Mean real := (X + Y + Z) / 3; 
SD real := SQRT( (X* + Y*% + Z°) / 3 — Mean® ); 
in 
Mean , SD 
endlet 
endfun 


iaieore I1.C.2: A SIMPLE STATISTICS 
FUNCTION AND ITS DATA FLOW GRAPH 
[McGRAW, 1980] 
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the output arcs (see Figure II.C.1). Tne example data flow 
eTaph in fieure I11.C.2 computes tne mean and standard 
deviation of its three input parameters. 

such a graph notation is useful in illustrating the 
various levels of parallelism in a program. For example, a 
gerapnh node may represent a simple operator such as addition, 
pies Entire statistics function ot fieure II.C.2. Thus tne 
data flow graph notation can represent parallelism existing 
at the operator, function and even computation level. Tne 
gfaphs execute asynchronously, nodes firing when cata inputs 
ame availabvle. Thus no syncawronization problem exists witn 
regard to accessing shared data. Bach data flow path can be 
marked with avalue by only one operator node. Once a value 
is on a path, no operator can modify that value. The value 
can only be read wnen used as an input to another node 
[McGRAW, 1989}. 

Again, the data flow graph notation merely allows a 
logical representation of a program at a level corresponding 
to conventional assembly language. This logical 
Begeeschtdation Shall now be extended to permit tne reader to 
understand the basic data flow hardware instruction 
execution mechanism. A simple example computation that shall 


be used to facilitate reader understandine is the following: 


he eee ee rk et ©) 
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Pigure II.C.3: A SIMPLE DATA FLOW PROGRAM 
Give a 


Pere 2) .C.4: AN ACTIVITY TEMPLATE FOR 
THE ADDITION OPERATOR 
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Figure 227.C.5: PROGRAM GRAPH USING ACTIVITY 
TEMPLATES FOR THE DATA FLOW PROGRAM GRAPH OF 
meee it.C.3 {DENNIS,198q 
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Tne grapn representation of this computation is snown in 
figure I1.C.3. 
| In the extended grapn representation scheme, a data flow 
program exists as a collection of activity templates, each 
template corresponding to a node in tne data flow program 
geraph. For example, figure I1.C.4 Shows an activity template 
for the addition operator. There are four fields in the 
acuivity template. [Rewienote fLelds denotes the operation 
code which specifies the operation to be performed. The 
second and third fields are receivers, wnich are locations 
waiting to receive operand values. The fourth field is a 
destination field which specifies where the result of the 
Operation on the operands is to go. There can be nultiple 
destination fields. Figure [1.€.5 Shows the program geraph 
representation of figure II.C.3, using activity templates. 

Activity templates have been developed which control the 
routing of data for such progran structures as conditionals 
and iterations. These templates are mentioned to point out 
the fact that graph nodes can represent not only simple 
operands but can also represe€nt more elegant and necessary 
constructs. 

some definitions which are necessary to the 
understanding of the data flow instruction execution 
mechanism. follow. First, a data flow program instruction is 
the fixed portion of an activity template and is made up of 


the opcode and tne destinations. 
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Pastuructc ion: 


<opcode, destinations> 


Each destination field provides the address of some activity 
template and an input (or offset) denoting which receiver of 


the template is the target. 


destination: 


<address, input> 


Data flow program execution occurs as follows. Tne 
fields of a template which has been activated (by the 


arrival of an operand value at each receiver) form an 


operation packet: 


<opcode, operands, destinations> 


Wnen tne operation packet nas been operated upon, a result 


packet of the form 


resultepacwet : 


<value, destination> 


is generated for each destination field of tne original 
activity template. Result packet generation triggers tne 
placement of the value in the receiver designated dy its 
destination field. Thus, at a4 logical level, data flow 


program execution occurs as a consequence of operation 
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Paeket sand result packet movements througn a macnine 
described in detail below. 

The basic data flow instruction execution mecnanism is 
shown in figure II.C.6. Tne data flow program, consisting of 
a collection of activity templates, is neld in tne activity 
store (see figure II.C.6). Each activity template is 
uniquely addressable within the activity store. When an 
instruction is ready to be executed (i. e. the template is 
enabled), this address is entered in the instruction queue 
unit (established as a FIFO buffer). 

The fetch unit iS then responsible for: removing, one at 
a time, instruction addresses from the instruction queve, 
Tevemmne the corresponding activity template, forming an 
operation packet based on the field vaiues in the template, 
and submitting tne operation packet to an operation unit for 
processing. The operation unit processes the operation 
Daceer by: performing the operation specified by the opcode 
on the operands, forming result packets (one for each 
destination field of tne operation packet), and transmitting 
the result packets to the update unit. The update unit fills 
in the receivers of activity templates (designated by tne 
destination fields in tne result packets) witn the 
appropriate values. The update unit is alSo responsible for 
checking tne target template to see if it nas all receivers 


filled, thus enabling the template. If so, the address of 


56 











the enabled template is added at the end of the instruction 
queue by the update unit. 
At tnis point it is appropriate to discuss how and where 


program parallelism can be exploited by tnis nardware. 


"...once the fetch unit has sent an operation packet off 
to the operation unit, it may immediately read another 
entry fron tne instruction queue witnout waiting for tne 
instruction previously fetcned to be completely processed. 
Thus a continuous stream of operation packets may flow 
from the fetch unit to the operation unit so long as the 
instruction queue 1S not empty. 


"This mechanism is aptly called a circular pipeline- 
activity controlled by tne flow of information packets 
traverses the ring of units leftwise. A number of packets 
may be flowing Simultaneously in different parts of the 
ring on »penalf of dadifferént instructions in concurrent 
execution. Thus tne ring operates as a pipeline system 
with all of its units actively processing packets at once. 
Tne degree of concurrency possible is limited by tne 
number of units on the ring and the degree of pipelining 
within each unit. Additional concurrency may be exploited 
by splitting any unit in tne ring into several units whicn 


can be allocated to concurrent activities. [DENNIS, 
NOV198@] 
Tne Dennis-Misunas data flow architecture oT 


implementing the described instruction execution mecnanism 
is called the cell block architecture and is illustrated in 
Pieures 1I1.C.7. 


"The heart of this arcnitecture is a large set of 
leet ruction ce@lis, each of which holds one activity 
template of a data flow program. Result packets arrive at 
instruction cells from the diStribution network. Fach 
instruction cell sends an operation packet to tare 
arbitration network when all operands and signals nave 
been received. The function of the operation section is to 
execute instructions and to forward result packets to 
target instructions by way of the distribution network. 
(DENNIS, NOV1989@] 
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Figure II.C.8 reflects a practical form of the cell 
block architecture which makes use of LSI technology and 
reduces the number of devices and interconnections. This 
meaeerca: form 15 obtainable by grouping the instruction 
cells of figure II.C.7 into blocks, each of which is a 
Single device. In tnis organization, several cell tlocks are 
serviced by a eroup of multifunction processing elements. 
The arbitration network channels operation packets from cell 
blocks to procesSing elements. Rather than employing ae set 
of processing elements each capable of a different functiorz, 
whicn is one design option, use of one multipurpose 
processing element type is the favored approach. Such an 
G@opreccn precludes the need for the arbitration network to 
route operation packets according to opcode. Instead, it 
Simply has to forward operation packets to any available 
processing element. It is this design which forms tne basis 
for the system model used in this researcn effort. 

How does tne basic n@ecnanism relate to the cell block 
architecture? Figure II.C.9 shows a cell block 
mipeememtadtion. it differs from the basic mechanism in two 
ways. First, the cell block has no processing element(s) 
(operation unit(s)). Second, result packets targeted for 
activity templates held in the Same cell block must traverse 
the distribution network before being handled by the update 


unit [DENNIS, NOV 198@]. Tnis is tne Dennis-Misunas data 
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Figure II.C.8: PRACTICAL FORM OF THE CELL 
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flow architecture design. Other desiens do exist; for 


examples, see [GOSTELOW, 1988] and (WATSON, 1979]. 


D. COMPUTER PERFORMANCE PREDICTION 

Computer performance prediction is an evaluation process 
which proposes to estimate the performance of a system not 
yet in existence (i.e. in some state of design). 
“Perfornance simply means how well a system works. Tnis in 
turn connotes the concept of value. SO, the purpose of 
estimating the performance of a system under design is to 
determine that system’s expected value. 

In order to quantify now well a system works or shall 
WOrkK, performance metrics called indices are used. Typical 
indices and their definitions are: 


THROUGHPUT RATE - Tne volume of information processed 
by a system in one unit of time 


HARDWARE UTILIZATION - The ratio between tne time tne 
hardware is used during an interval 
of time, and the duration of that 
trverval or time 

RESPONSE TIME - The elapsed time between tne sub- 
mission of a program job to a system 
and completion of the corresponding 
job output. 

Computer performance prediction can be acnieved via 
Several different techniques. Kach technique has limitations 
and advantages. The technique utilized in this thesis is 
that of Simulation. The simulation technique involves tne 
representation, by a model, of certain aspects of tne 


behavior of a syStem in the time domain. Observing tnese 
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aspects of the behavior in time of the system’sS model, under 
inputs generated by a model of tne system’s inputs, produces 
results useful in the evaluation of the modelled system 
{[FERRARI, 1978]. For tne purposes of this research, the 
aspects of behavior that are of interest are the performance 
indices previously defined. 

Of sienificant importance to any simulation effort are 
the issues of validation and parameter estimation. 
Conceptually, validation attempts to establish some degree 
of confidence that the simulation Shall produce results 
which shall closely correspond witn tne performance of the 
System under scrutiny. Parameter estimation provides the 
Simulation effort with hopefully credible parameter values 
needed to perform 4 Simulation having relevant results. 
These issues shall be addressed in section i 9 anes 
Experimental Desien. 

The last section of the review of the literature 
applicable to this research end@€avor presents tne 
Requester-Server methodology. The Requester-Server 
methodology is the tool used to perform the simulation 
which generates the results on which tne prediction of data 
flow performance is based. 

Readers desiring a more thorougn presentation of tne 
Subject of computer performance prediction are referred to 
(FERRARI, 1978}, (COX, 1978], [ALLEN, 1980], [HAMMING, 


1975], [SPRAGINS, 198¢], [BUZEN, 1980], and [SAUER, 1989]. 
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KE. REQUESTER-SERVER METHODOLOGY 

The Requester—Server (R-S) methodology was designed and 
initially implemented by L. A. Cox, Jr. [COX, 1978). 
Subsequently, the Requester-Server software was moditied by 
D. M. Stowers (STOWERS, 1979} to run on the PDP-11/5¢ 
minicomputer at NPS. This section summarizes tnose portions 
of (COX, 1978] and (STOWERS, 1979] whicn are applicable to 
and necessary for the understanding of this research. 

The R-S methodology is capable of predicting the 
performance of computer systems Characterized by 
asynchronous, concurrent behavior. The methodology can 
predict performance at both the computer system and computer 
job levels. The R-S methodology allows” the user TO 
separately specify the hardware configuration(s) to be 
evaluated, tne software (programs) to be used in evaluating 
the hardware configuration(s), and the mechanism or policy 
for allocating hardware resources to program requests for 
service. The methodology makes provision for variable levels 
of detail (in a hierarchical sense) in both the hardware and 
software. Finally, tne R-S metnodology is capable of 
Simulatine concurrency in both the hardware and software. 
Thus, for a given hardware configuration, the control 
Structure mandated by the software can be mapped onto tne 
hardware and system performance analyzed and predicted. 

The simulation process is begun by representing the 


software (programs) and hardware as two separate Petri net 
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graphs. In tne Petri net grapn of tne software, eacn arc can 
be thought of as having an associated propagation delay, the 
extent of wnich is dependent upon the nardware configuration 
used to execute the program. If these delays are definable 
by their correlation to the Petri net model of the hardware, 
then performance values for the indices of section II.D 
(Computer Performance Prediction) can be obtained by 
executing tne Petri net model of the software on the Petri 
net hardware configuration(s). The R=S “tool” serves as the 
interface between the Petri net model of system software and 
Pewra net model of system hardware. Tnis interface permits 
the hardware and software Petri net grapns to be constructed 
separately. Tnis is important because tne control structure 
and sequencing constraints ot both hardware and software can 
be maintained separately. This permits a direct and 
meaningful representation of both tne system software and 
hardware being modelled. 

The source file wnicn s@rves as tne input to tne R-S 
program is organized into three sections. The software 
secwwon of Lhe Input file consists of a description of tne 
Petri net graph representing the software program(s) to be 
executed. This net graph description is formulated in terms 
of the functions and constraints of the services required of 
the nardware. Tne nardware section of tne input file is made 
up of a description of the computer syStem components and 


their interconnections. This description can be 
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(nierarcnically) at a bit-level or major component level, 
depending on the system aspects under scrutiny. The Petri 
net graph upon which tne nardware description is based is 
constructed in terms of its operation in time. Tne last 
section of the input file, called tne dynamic section, 
provides tne user of tne R-S “tool a place to denote system 
initial conditions by defining the hardware and software 
nets” token markings at the beginning of a “run. AS may be 
recalled from section II.B (Petri Nets), both the software 
and hardware sections merely define static Petri net 
structures. Performance precicrion follows fron the 
attachment of significance to the Structures and 
restrictions on token movenent within these structures. 

The dynamic nature of Petri nets is exploited by this 
R-S methodology as follows. The software net representation 
makes a series of requests for the services of the hardware 
net representation. Repeatedly, the R-S process maps these 
reguests for service onto tne nardware net representation. 
At each “invocation the R-S process runs the hardware net 
to provide tne service requested by the software net. Upon 
completion of Gach of the Service requests, the R-S process 
“runs the software net representation until the hardware is 
again needed. This cycle repeats itself until tne software 
net representation has been completely ‘run and _ its 


terminal state reacned,. 
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Events in The hardware net grapn correspond to 
operations in time. A collection of events is used _ to 
represent each functional unit. Token movement through the 
nardware net graph corresponds to tne flow of data and 
control through the modelled hardware system. A simple 
hardware net description is provided in figure I[1.E.1. 
Events in the software net grapn correspond to requests for 
service. As an example, an event could equate to a request 
for a floating point multiplication. The flow of tokens in 
the software net grapn equates to the logical flow of the 
algoritnom, constrained by its implicit data dependencies or 
Sequencing constraints. A simple software net description is 
provided in figure II.£8.2. 

Togetner, tne software and nardware net grapns can  0»be 
executed in such away as to simulate the operation of the 
conputer system for tne given software workload. The 
interaction of the two net graphs is orchestrated by the R-S 
token arbiter. Network simulation begins with the marking of 
the BEGIN node of the software net graph. Tnis net grapn 
is then executed as would be any Petri net graph. The 
arrival of a token at any place in tne software net grapn 
indicates a request for service, at which time the #-S toxen 
arbiter takes control. (Tne type of service requested is 
denoted by the type of the place and is defined in tne 
software net description.) The R-S token arbiter removes the 


token fron the software net and tnen permits tne software 
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net graph to continue executing until no further noves are 
possible. The R-S token arbiter tnen initializes tne 
hardware functional unit (net grapn denoted by the type of 
service requested) by narxing it with tokens. Tne hardware 
net e#raph is then executed one Step. Tokens reaching events 
corresponding to service completion are Yremoved, and the 
token of the software net which originally caused the 
request for service is replaced, by the R-S token arbiter. 
Repeating this sequence of actions results in the execution 
of tne software net graph by tne hardware net grapn. A 
Sample input file dynamic section and the results obtained 
from executing tne software and nardware net graph 
descriptions of figures I1.8.1 and II.E.2 are presented in 
figure Teles © « Those readers interested in the 
Reqguester-Server metnodology are referred to (COX, 1978} and 
(STOWERS, 1979] for a more in-depth discussion of its 
capabilities and usage. 

This completes the necessary review of the literature 
rejuired to understand tne researcn tnat follows. The 
fundamental concepts of the various approaches LO 
parallelism, Petri nets, tne data flow architecture, 
computer performance prediction, and the Requester-Server 
methodology have been reviewed. The next section presents 
the two-part nypotnesis wnicn tnis researcn addresses and 


pests. 
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Because their exist several data flow architecture 
proposals, it is desirable to have a tool witn whicn to 
predict the performance of the diverse designs for 
conparison purposes. Tne first part of this research’s 
hypothesis was that the Petri net-baSed Requester-Server 
(R-S) methodology is such a tool, capable of predicting tne 
performance of data flow arcnitectures in an efficient, 
accurate manner. In effect, the R-S tool was to be tested. 

The second part of this re@search’s nypotnesis was 
concerned with the Dennis-Misunas data flow architecture 
design. This design was chosen for two reasons. First, tnere 
existed adequate information in the literature about this 
design on which to base an accurate model for simulation 
pupeoses. second, the Dennis-Misunas design of the basic 
instruction execution mechanism is essentially the same as 
several other schemes in various stages of implementation 
(DENNIS, 1979]. The hypothetical challenge to tnis cesign 
was that the goal of achieving higner speed computation is 
not attainable unless a high and “intelligent degree of 
multiprogramming is realized, as Shall be explained next. 

Obviously, high speed computation shall require a high 
hardware utilization. By this it iS meant that most of the 


processing elements (PE’s) shall have to be pertformine 
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useful work most of tne time. Such a hnign nardware 
utilization is attainable when either of two situations 
occurs. First, a nign hardware utilization will result when 
a process possessing a large amount of inherent parallelism 
is being run (by itself) on the macnine. In this case, a 
program’s execution time is dependent upon its amount of 
inherent parallelism and the number of PE’s in the machine. 
Second, a high hardware utilization is attainabl@ wnen a 
multiprogramming environment (in which several processes are 
permitted to simultaneously run on the machine) is 
instituted. In such a multiprogramming environment, an 
individual process shall be competing for hardware resources 
(PE’s). Thus, that process” execution time may be lengthy 
regardless of its amount of inherent parallelism. This is so 
because that process may have the use of only a small 
portion of the machine’s resources (PE’s) at any point in 
time. Put another way, if at any time a process has N 
mieewmeactsons dvailable for execution, but there are less 
than N PE’s available for executing tnose instructions in a 
parallel fasnion, tnen tne process’ ex€cution time snall be 
lengthened over what it could be if it had sufficient PE‘s 


available. 











=<. -—_ —_ 


In sucn a situation, a scneme may be needed to implement 


a policy which achieves two objectives: 


1. maintaining nign hardware utilization and 
2. providing an acceptable average response time for 
weuser Tequiting a Ziven amount of processine. 

“Acceptable average response time is construed to mean that 
the actual response time of any particular program which 
requires a given amount of processing shall not be 
lengthened considerably over what it would be if the program 
were executed by itself on tne data flow macnine. Thus, it 
Shall be desirable to minimize the affect of system load on 
an individual program’s execution time. That the second 
objective should be met even at the expense of the first 
objective is a strong point made by [KLEINROCK, 1976]. By 
merely mapping processes onto the data flow machine as they 
arrive, it is expected that objective #1 shall be achieved 
but at the expense of objective #2. This situation was 
expected to be demonstrated by this research. 

Tne purpose of this section nas been to frame the 
research area by presenting the issues which eive rise to 
the hnhypotnesis. Tne following section presents the metnod 
used to test the hypothesis, and includes a discussion of 
the assumptions made to facilitate the simulation 


experiments, wnere undecided design issues remain. 
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IV. METHOD 


A. EXPERIMENTAL DESIGN 

The experiment to allow prediction of data flow computer 
performance involved executing sets of Petri net models of 
data flow programs on Petri net models of data flow 
hardware. The Requester~Server (R-S) program tool monitored 
the data flow (model) programs’ ‘execution and provided 
data which permitted the determination of the performance 
indices: response time and nardware utilization. It is 
important to realize that the results of this research 
predict the performance of a model of a data flow computer- 
not that of an operating data flow macnine itself. (Model 
validation and parameter estimation issues are addressed in 
section IV.B: Data Flow Hardware Definition.) 

The reader who is familiar with analytic modelling 
employing queueing tneory nay ask wny tnat tecnnique, ratner 
than the simulation technique, iS not used to predict the 
performance of the data flow design. The answer is that the 
analytic approach unnecessarily constrains tne prediction by 
requiring assumptions to be made about’ the software. 
specifically, the Petri net models of softwar@ programs are 
discretely defined with regard to the amount of inherent 
parallelism available for exploitation at each time step in 


program execution. To model analytically, the variability of 
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inherent parallelism available for exploitation must be 
described by probability distributions which nide tne 
definable nature of the programs at discrete time steps. 

For this experiment, tne data flow arcnitecture was _ to 
be modelled with several different quantities of processing 
elements (PE’s). Tne sample data flow program models were to 
be characterized by varying but definable amounts of 
inherent parallelism avces ante wetor= Cxploitation. S&ach 
(model) program was to be Separately run on each (model) 
hardware configuration. (Hereafter, the word “model shall 
be omitted but assumed in referring to the program and 
hardware models used in this experiment.) Data was to te 
Ompaumed to permit determination of the performance indices 
(response time and hardware utilization), for each run, from 
the monitor function of tne R-S tool. After running each 
program separately, arbitrary program mixes were to be run 
On @acn hnardware configuration and the same performance 
indices again determined. Finally, hand-optimized program 
mixes were to be run on @ach nardware configuration and the 
sane performance indices determined once again. Ey 
evaluatine the results, the hypothesis was expected to be 
Pitvmer supported or refuted. 

The independent variable for this experiment was defined 
to be tne quantity of PE’s available to tne hardware model. 
Because PE’s are but one resource demanded by a process in 


execution, otner independent variable choices could nave 
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included other resources such as: the quantity of cell 
blocks available, the type of distribution or arbitration 
network employed, and/or the type of PE’s (multipurpose or 
sets of single-purpose functional units) utilized. Expanding 
the number of independent variables increases significantly 
the complexity of evaluatine the results. How these issues 
were resolved is explained in section IV.B. 

The dependent variables for this experiment were the 
parameters response time and hardware utilization. The 
results of the experiment were expectei to provide data 
which could be plotted on graphs. Curves plotting tne 
execution time of each data flow program against tne number 
of PE’s would constitute one such graph. Others, and their 
Significance, are presented in section V: Results and 


Discussion. 


B. DATA FLOW HARDWARE DEFINITION 

The Petri net models of the data flow hardware 
configurations were quantified in terms of their operation 
in time. Such Guantirication required essigning tine 
duration values to @ach portion of the cell block 
arcnitecture nodel in sucn a way as to closely model tne 
hardware. Doing so required several assumptions to be made. 
Those assumptions shall be addressed individually so as to 
help substantiate the credibility of the resultant hardware 


models. 
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To pegin, the processing elements (PEs) were assumed to 
be multifunctional, capable of executing any instruction 
routed to it in one ‘standard instruction execution time 
unit. Allowing the PE’s to be multifunctional and 
Characterized by a singular execution time Simplifies tne 
modelling process. 

There were at least two other possibilities that could 
be accommodated by expansion of tne Petri net models. First, 
each multifunction PE could be replaced by a set of 
Single~purpose PEs, each Single-purpose PE defined in terms 
of its particular instruction execution time, and capable of 
executing concurrently with other PE’s of the set. Second, 
each PE could be replaced by a subdnet in which only one 
instruction could be executed in any given time step, but 
the model would define the execution time as a function of 
the instruction type. 

The first alternate approach implies a more complex 
arbitration network with a conceivably longer routing time. 
The second alternative would require additional net 
complexity. (However, this approach would ve a good 
possibility for subSequent research.) Because tne actual 
implementation eourdeuratition has not been finalized, 
modelling the PE’s as multifunctional and cnaracterized by a 
Singular execution time was a reasonable path to follow. 

The distribution network design also has not been 


finallized. For ease of modelling purposes, a crossbar 
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Switch design capable of supporting simultaneous transfers 
of result packets to cell blocks was cnosen to be modelled. 
This choice permitted a standard routing time to be 
Characterized by tne model. Otner networs gestions. 
especially packet routing networks, may be preferred to the 
crossbar switcn for the ultimate machine because of their 
lower cost and comparable performance in a data flow 
architecture [DENNIS, 1979]. 

Tne choice to model the PE’s as multifunction units 
precluded the need for anything but a simple arbitration 
network. Such a network would merely nave to route operation 
packets to any available PE. Accordingly, in the model, a 
Standard routing time for this network was Characterized. 

With regard to the cell blocks, tne assumption was made 
that sufficient cell blocks were available to hold all 
portions of all processes being run on the macnine at each 
and every instant. Thus, there iS no notion of pagine 
portions of processes into and out of memory (the activity 
Store in the case of tne data flow arcnitecture). This 
assumption carries with it the assumption that all program 
compilation (resulting in extended data flow graph-like 
representations) is complete before beginning program 
execution. Other compilation strategies are under 
consideration, such as requiring the user to interact witn 
the system to acnieve a high degree of parallelism 


exploitation [McGRAW, 1980). 
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As has been described, Other nardware choices 
(representing independent variables in an experiment) can be 
made and easily implemented by simply defining appropriate 
subnets whicn, in a time-wise fasnion, characterize tne 
portions of the hardware under scrutiny. The approach taken 
alga this research permitted tne nardware timing 
characteristics to be a function or Simply the number of 
PE’s. Figure IV.B.1 is the Petri net representation (of the 
cell block architecture hardware) utilized in this researcn. 
For the purposes of this experiment and in the configuration 
described, each PE was assumed to be driven at the rate of 
two million floating point operations per second (FLOPs), a 
rate claimed to be reasonable by [DENNIS, 1980]. This figure 
represents an instruction execution time of 502 nanoseconds 
(nsec). (This is represented in the hardware model by 
signifying a scaling of @acn event/transition pair to equal 
102 nsec.) ASsSociating timing characteristics witn eacn 
component in the data flow architecture design results in a 
similar figure as snown in figure IV.B.2. 

PE (instruction execution) 5@ nsec 
CELL BLOCK (nemory fetcn assuming MOS tecnnology) 258 nsec 


DISTRIBUTION NETWORK (assuming crossbar switcn) 25¢ nsec 
ARBITRATION NETWORK (asSumineg negligible) - 


[WEITZMAN, 1988]. TOTAL: 558 nsec 


FIGURE IV.B.2: TIMING CHARACTERISTICS OF DATA FLOW 
ARCHITECTURE COMPONENTS 
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For the purposes of this research, the quantity of PE’s in 
tne hardware was varied from one to sixteen, by multiples of 
two. This resulted in data generated for tne following 
quantities of modelled PE’s: 1, 2, 4, 8, and 16. 

This summarizes the assumptions utilized in developing 
the model presented here. The following section describes 
the Petri net definition of tne data flow software- program 


models which were executed on the hardware models. 


C. DATA FLOW SOFTWARE DEFINITION 

The Petri net models of data flow programs were 
Quantified in terms of the amount of inherent parallelism 
available for exploitation at eacn discrete time step as 
well aS in terms of the implicit data dependencies of tne 
programs. (As previously mentioned, the data dependencies 
define the control flow of a program.) Tne initial approacn 
involved taking Sample programs written in the high level 
language (nll) VAL and converting them to their equivalent 
Petri net representations for subsequent execution on tne 
data flow hardware models. The problem with this approach 
was that the compilation process is not yet developed. Tnus, 
what hardware instructions would be required for each hill 
instruction were not determinable. 

The subsequent approach, which was utilized, involved 
designing Petri net program models characterized by various 


but discretely definable levels of inherent parallelism. 
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Executing sucn artificial programs conceivably produced more 
informative resultS than would have been obtained witn a few 
select programs which may nave only demonstrated data flow’s 
Suitability for tnose special purpose computations. Tne 
individual programs Shall be characterized after introducing 
a new concept. 

A new concept introduced at tnis point is that of a 
software concurrency vector. A concurrency vector is a 
tuple, each entry of which defines the amount of inherent 
parallelism in a program at the Operation packet 
Hierarchical level, at a discrete instruction execution time 
Step. Each entry of the tuple is implicitly subscripted by 
Doeeeetime st€p it describes. For example, tne simple 
Statistics function of section II.C (see figure II.C.2, page 
32) would be characterized by the concurrency vector: 
(4,2,2,2,1,1). In this example concurrency vector, tne "4° 
represents the fact that the four operations “+, “SQ”, 
“SQ, and “SQ” could be processed in parallel during tne 
first time step of execution of the simple statistics 
function. Tnis is so because no sequencing constraints exist 
among these four operations. Thus the concurrency vector 
defines how many operation packets could be parallel 
processed if all the instructions (i.e. functions—- addition, 
subtraction, division, square, square root) were implemented 
in nardware. (If tney were not, the subfunctional operation 


packets required by the instruction would be considered in 
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@efininge the concurrency vector entries.) It snould also  0obe 
recognized that the concurrency vector, though a function ofr 
a progran, is dependent upon a standard instruction 
execution time duration. If the hardware is implemented such 
that execution time is a function of the instruction type, 
then the concurrency vector entries could be described at an 
even lower level than the operation packet level. Such a 
level would correspond to a basic hardware cycle time, where 
ex€cuting an instruction would require some number egreéter 
than one hardware cycles to complete. This additional 
complexity need not be considered in this research in view 
of the nardware design approach taken, but could be 
accommodated by the R-S methodology used here. 

Four programs ("A through "D") were utilized in this 
research. These programs are differentiable by tneir lengetn 
aS well as by the amount of inherent parallelism available 
a0 Tr exploitation at SacHatimem=cue nc the Pretri net 
representations of these programs are shown in figures 
IV.C.1 tnrougn IV.C.4. Additionally, the concurrency vector 
for each is shown. The program mixes for tnis experiment 
included one of each of the three programs, A, Band 
“D". Another program mix including one of each of the four 
programs, "A. through “D , was also used. 

The following section presents tne procedure utilized in 


executing tne experiment. Additionally, the method of 
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FIGURE IV.C.3: (MODEL) PROGRAM "C" 
CV: (6,6,5,5,4,4,3,35252) 
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mapping the program mixes onto each of tne hardware 


configurations is explained. 


D. PROCEDURE/IMPLEMENTATION 

The four procedural steps utilized in executing this 
experiment were as follows. First, Petri net models of bdotn 
the hardware configurations (figure IV.B.1) and software 
programs (figures IV.C.1-.4) were converted to a format 
acceptable aS input to the Requester-Server (R-S) program. 
Two Pascal programs, compiled and executed on the NPS 
"B-side. PDP-11 (a UNIX-based system), facilitated tne 
(separate) generation of the hardware and software portions 
of tne input files for tne R-S program. Bacn input file was 
formed by concatenating the hardware and software portions 
and tnen editing the resulting file to define the dynamic 
execution desired. The second Step was to transfer each 
complete input file from tne NPS ‘B-side’ to the NPS 
“A-side” PDP-11 (an RSX-11M—based system), via an 
inter-processor link. Thirdly, the R-S program waS run on 
tne “A-side, taking as input tne file wnich defined the 
hardware, software, and dynamic execution desired. frourth 
and finally, data regarding the execution of the software on 
the hardware was obtained from the output file generated by 
the R-S program. Tne results of the data analysis are 
presented and discussed in section V (Results and 


Discussion). 
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Tne implementation portion of this section addresses tne 
technique used to map the software onto the hardware in such 
away as to effectively simulate tnis function as it miegnt 
be done on a fr@al data flow machine. The first set of 
experimental ‘runs’, which consisted of tne separate running 
of each program on each hardware configuration, was 
straightforward in implenentation. Tne procedure described 
above, in which e@acn program file portion was concatenated 
with the appropriate hardware file portion, achieved a 
relevant napping for modelling a single process running ona 
particular hardware configuration. Tne subsequent set of 
experimental ‘runs, in which a program mix was mapped onto 
each ofr the nardware configurations, was not so 
straigntforward in implementing as shall be explained next. 

To understand the mapping of software (i. e. processes) 
Onto data flow hardware, it is helpful to scrutinize the 
functions of tne operating system for sucno a macnine. 
Because the scheduling and Synchronization of concurrent 
activities are built in at tne hardware level, a data flow 
machine’s operating syStem will only be responsipdle for 
initialization, termination, and input/output (1/0) of 
processes. Once a process 15 mapped onto the data flow 
machine, it runs to completion without further intervention 
by the operating system (except for I/O). Tne question which 
must be answered is: When should another process be mapped 


onto a machine whicn is already executing one or more 
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processes? Thus, in defining the input file of a program mix 
representative of ready processes, the program mix had to be 
defined in terms of a mapping function. 

The mapping functions can be thought of as operating 
System assignment policies. Thus, for those runs involving 
program mixes (as opposed to single programs), an assignment 
policy nad to be simulated. One aspect of tnis researcn tnen 
can be viewed aS an investigation of different policies for 
mapping processes onto data flow machines in a 
multiprogramming environment. 

Bach program mix consisted of the three programs “A, 
"B” and "D'. (A later “run” for which data was gathered 
utilized the four program nix consisting of one each of the 
programs "A. through "D’.) Bach mix was varied in tne way in 
which it was mapped onto tne hardware, in simulating 
different operating system mapping functions. The operating 
System asSignment policies for mappine a program mix orto 
the hardware configurations follow. Three policies were 
sinulated. First, tne tnree programs were permitted to begin 
"execution at tne same time. Second, an “80% Rule. was 
simulated in which an additional program was permitted to 
begin "execution whenever the hardware utilization dropped 
below 80%. Tnird, an “intelligent” assignment policy was 
implenented via a mapping function based on the programs” 
concurrency vectors. This assignment pol hey, 1t was 


envisioned, would cause Optimal performance in terms of the 
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performance indices: response time and nardware utilization. 
The concurrency vector approach optimizes the assignment of 
processes onto the macnine by fitting together concurrency 
vectors of ready processes in Such a way tnat the objectives 
noted in section III are acnieved. For example, given a 
machine with eight PE’s, the concurrency vectors would be 


fitted as shown in figure IV.D.1. 


JOB “A': (4,3,2,1,2,3,4) 

JOB B : CoS. 4 54Dse) 

JOB CC: (aye ye ce) 

JOB Di: - ees, 44) 

won 5 S (ro as 

TOTAL: £08 898484848548 54848 58 By 74S eee 
Se eee Ne ee ee 


FIGURE IV.D.1: AN EXAMPLE OF “FITTING” CONCURRENCY 
VECTORS TOGETHER 

Pameenmeratinge the concurrency vectors at compile time, tne 
program Can declare beforehand those resources (aS a 
function of time) needed for ex@€cution as well as when tne 
program will be completed. An operating syStem can thus 
choose the sequencing of the running of the waiting 
processes to achieve the best fit to best meet tne 
objectives of section III. 

The results of tnese experimental runs are presented in 
the following section in graphical form. Additionally, the 


meaning and significance of tne results are discussed. 
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V. RESULTS AND DISCUSSION 


In response to tne first part of this researcn’s 
hypothesis, it is proposed that the Petr t net-based 
Requester-Server (R-S) metnodology is indeed a desirable 
tool with which to predict the pertormance of the diverse 
designs of data flow arcnitectures. The ability to 
S@parately specify the hardware, software and resource 
allocation policy was an R-S feature which permitted 
efficient generation of the combinations of the above three 
items. The ability to easily implement variable levels of 
detail in both the hardware and software was not exploited 
but the method for doing so was introduced. Finally, tne R-S 
methodology’s capability of Simulating concurrency and 
asynchronous benavior in both tne nardware and software is a 
necessity for accurately modelling and Simulating data flow 
computing. 

The results which address the second part of tnis 
research’s hypothesis are now presented. To begin, figure 
Vel snows individuel program execution times as a function 
of the number of processing elements (PE’s). These absolute 
execution times were used as a basis for comparison witn the 
results from the multiprogrammineg environment runs. Percent 
hardware utilization is displayed adjacent to each data 


point. The hardware utilization values are averages of tne 
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hardware utilizations at each tine step during execution. 
Tne graph shows tnat a program’s execution tine can be 
drastically reduced by increasing the number of processing 
resources (PE’s) available up to tne point where execution 
time 1S bounded by the amount of inherent parallelism 
available for exploitation in tne program. 

The following resultS pertain to the running of the 
program mixes ina simulated multiprogramming environment. 
Initially, it was intended that program mixes would contain 
a greater quantity of programs than were actually run. fTnis 
was not achieved due to time constraints. Accordingly, the 
results should be considered preliminary in nature. On a 
positive note, tne results provide insight into several data 
flow operation issues. Fieure V.2 provides tne raw data for 
this research with the exception of the data utilized in 
Geonmputing hardware utilization. Figures V¥.3, ¥.4, and ¥.5 
present the hardware utilization (as a function of time) for 
the 4-, 8-, and 16-PE configurations. Similar graphs for the 
1- and 2-PE configurations are presented in figure V.6 (note 
the identical nature). 

The implications Supported by this data follow. Tney are 
not definitive because of the small quantity of programs and 
preesreamn mires in the model. Thus, while tne second part of 
the hypothesis may not have been adequately tested, the 


metnodology for doing so appears to be available in the R-S 
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tool. Additionally, it is maintained that the initial 
results support tne discussion whicn follows. 

Whenever the amount ofr concurrency in all runnine 
processes exceeds tne number of PE’s available to meet the 
requirements of the processes, some Slowdown in execution 
tine results for some processes. The dual of this result is 
that, so lone as there are adequate PE’s available, no 
Slowdown in any process” execution time results. 

Under the All Begin Together (ABT) assignment policy, 
the data flow hardware becomes overloaded’, resulting in 
the slowdown just described. For example, program “A, 
though the first to begin processing under the ABT scneme, 
is the last to finish under the three program mix. Under the 
four program mix with 16 PE’s, tne “C program takes longer 
Only because of its great length and inherent parallelism. 

Under tne 88% Rule assignment policy, the average 
hardware utilization is lower than tnat under tne ABT policy 
(for the three program mix). Also, the average of the three 
prograns execution times is lower tnan that under the AET 
policy. For the four program mix, the average hardware 
utilization is sligntly greater. This reflects a better 
mapping of processes Onto the machine. (Average nardware 
utilization is defined as the average, over tne duration of 
a run, of the hardware utilization percentages at eacn time 


Step of that run.) 
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Under the optimized concurrency vector (CV) approach, 
programs were mapped onto the hardware configurations in 
Such a way aS to achieve a high nardware utilization at eacn 
time step aS well as minimize the average response time of 
the programs in tne mix. Tne results indicate average 
hardware utilizations at least aS high aS under either of 
the otner assignment policies. Also, the average response 
times were at least aS low aS under either of the other 
assignment policies. This optimized concurrency vector 
approach should be suitable for machine optimization. Using 
concurrency vectors generated at compile time, the mapping 
of additional processes onto tne data flow macnine snould 
probably continue only so long asS acceptable average 
response time for any process is not exceeded. When a 
process characterized by more inherent parallelism than can 
be currently accommodated on the data flow macnine is 
awaiting asSienment (i. e. mapping onto the data flow 
machine), that process’ assignment should be delayed until 
sufficient (or, if necessary, all) PE’s are available to 
parallel process the computation. (A uSer advisory denoting 
such a delay would be highly desirable.) 

The time spent in preprocessing jobs in accordance witn 
any assignment scheme to achieve some level of optimization 
may be unnecessary or even wasteful. Tnis trade-off will 


have to be examined in greater deptn. 
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VI. SUMMARY 


Following a review of the pertinent literature, a 
two-part hypothesis was proposed. First, the Petri net-bdased 
Requester-Server (R-S) methodology’s suitability for 
predicting the performance of data flow machines was to be 
Memed. second, it was hypothesized tnat the goal of 
economically achieving higher speed computation throuen data 
flow computing would be unattainable without achieving a 
high and el bene degree of multiprogramming. The RkR-S 
methodology, a simulation technique, permits tne separate 
specification of the hardware to be evaluated, the sottware 
to be used in the hardware evaluation, and the policy for 
allocating hardware resources to program requests for 
service. Accordingly, Petri net models of data flow hardware 
Configurations were quantified in terms of their erécution 
in time, and Petri net models of data flow programs were 
Quantified in terms of tne amount of inherent parallelism 
available for exploitation at each discrete time step, as 
well as in terms of the implicit data dependencies of tne 
program. Model programs were ‘run on model nardware 
configurations. Results obtained from the monitor function 
of the R-S program were analyzed, with respect to. tne 


performance indices: hardware utilization and response time. 
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Three assignment policies for determining when to map 


additional programs onto a data flow machine were tested: 


1. all programs begin together 


2. aSSien an additional program whenever the 
hardware utilization drops below 80% 


SO. aSSien an additional program based on a 

concurrency vector. 

Results show that the R-S methodology is indeed an 
efficient and easy-to-use tool for investigating data flow 
architectures. Also, initial results indicate that optimized 
Scheduling based upon concurrency vectors is viable for 
deciding when t0 map additional processes onto a data flow 
machine to achieve the objectives of maintaining hign 
hardware Daly. t on and providing acceptable average 


response time. 
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VII. RECOMMENDATIONS FOR FURTEER RESEARCH 


With regari to the methodology, worthwnile additions to 
the R-S program would be user-friendly ‘front- and 
“back-ends which would further simplify both the generation 
Grewinout files for the R-S tool and the retrieval of desired 
data from the output file generated by each run. 

In the area of data flow, simulations in which the 
hardware definition of the arcnitecture was varied (as 
described in section IV.B) could provide insights regarding 
the optimal nardware configuration for tne expected program 
load. In particular, the quantity of (modelled) PE’s should 
be increased to a number closer to tne amount expected in 
the actual machine (approximately 512). Ot course, in order 
to model more accurately, the expected load in terms of the 
quantity of programs and typical amounts of inherent 
parallelism shall nave to de defined more exactly. 

A final open area within data flow researcn is tne 
development and testing of specific algoritnms using 
concurrency vectors to permit machine optimization and 


implementation of a desirable assignment policy. 
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